Summarizing and Searching Sequential Semistructured Sources

نویسندگان

  • Roy Goldman
  • Jennifer Widom
چکیده

XML, the eXtensible Markup Language [XML97], is fast becoming the de-facto representation for semistructured data. In the research community, initial work on semistructured databases was based on simple graphbased data models such as the Object Exchange Model (OEM) [PGMW95]. Though XML and OEM are similar, there are some differences [DFF99, GMW99], and one of the most significant of these concerns data ordering. OEM and other original semistructured data models are set-based: an object has a set of subobjects. However, since XML is a textual representation, any XML document specifies order inherently: an element has a list of subelements. Of course, some applications may treat order as an irrelevant artifact of the serialization “forced” by an XML representation. Still, we cannot preclude XML content authors from taking advantage of order. For example, a publications database in XML may represent a publication' s author ordering simply by using an ordered list of Author subelements under each Publication. As researchers have adapted their work on semistructured data to XML, the issue of order already has been addressed at the data model and query language level [DFF99, GMW99]. In this paper, we focus on the impact of ordered subelements on two important technologies associated with semistructured data: DataGuides [GW97] and proximity search [GSVGM98]. A DataGuide is a concise, accurate structural summary of a semistructured database [GW97]. DataGuides are constructed and maintained dynamically from a database, and they have proved useful for a variety of purposes: browsing, query formulation, storing statistics, query optimization, and most recently compression of XML data [LS00]. DataGuides were defined originally in the context of the OEM model: DataGuides summarize unordered OEM databases, and a DataGuide is itself an unordered OEM object. It is straightforward to use our original DataGuide algorithms to create and maintain unordered XML DataGuides over XML data. However, capturing order in XML DataGuides introduces some new and interesting issues. In this paper we present and evaluate several approaches for creating ordered XML DataGuides that effectively summarize the order in the original XML data. Proximity search is a concept from information retrieval (IR) that we applied to searching graph-structured databases [GSVGM98]. In a traditional IR setting, proximity search is typically implemented with a Near operator and is effective for identifying documents that contain multiple keywords in close proximity— where distance is defined based on the number of characters separating the keywords. But when structure is present, textual nearness is not always appropriate. For example, in an XML publication list an author subelement for a publication could be textually closer to the following publication' s title than to its own title. Our proximity search approach takes structure into account by instead considering shortest paths in graph representations of the data. We build special indexes for this purpose, and experience indicates that our approach does an effective job of capturing what proximity search should mean in a structured or semistructured database [GSVGM98]. As with our DataGuide work, however, our proximity search work was based on an unordered data model. In this paper we show how to modify proximity search to incorporate the inherent order of XML subelements. Specifically, we show how to augment the graph representation of XML data such that shortest path computations account for subelement order. We demonstrate the impact of our changes in a sample scenario where subelement order is clearly relevant to proximity search. Throughout this paper we assume the mapping of XML data to an ordered labeled graph as specified in [GMW99]. In brief, each element and each attribute maps to a node in the (rooted) graph, and edges cor-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Publication IX

We model dependencies between m multivariate continuous-valued information sources by a combination of (i) a generalized canonical correlations analysis (gCCA) to reduce dimensionality while preserving dependencies in m − 1 of them, and (ii) summarizing dependencies with the remaining one by associative clustering. This new combination of methods avoids multiway associative clustering which wou...

متن کامل

A Semantic Portal for Fund Finding in the EU: Semantic Upgrade, Integration and Publication of Heterogeneous Legacy Data

FundFinder is a Semantic Web portal that allows searching for and navigating through information about funding opportunities. This application has been created following a set of techniques and using a set of tools for the upgrade of legacy content to the Semantic Web, including databases and semistructured documents. This process consists in extracting and populating knowledge from heterogeneo...

متن کامل

Representative Objects: Concise Representations of Semistructured, Hierarchical Data

In this paper we introduce the representative object, which uncovers the inherent schema(s) in semistructured, hierarchical data sources and provides a concise description of the structure of the data. Semistructured data, unlike data stored in typical relational or object-oriented databases, does not have fixed schema that is known in advance and stored separately from the data. With the rapid...

متن کامل

Exploratory modeling of yeast stress response and its regulation with gcca and associative clustering

We model dependencies between m multivariate continuous-valued information sources by a combination of (i) a generalized canonical correlations analysis (gCCA) to reduce dimensionality while preserving dependencies in m - 1 of them, and (ii) summarizing dependencies with the remaining one by associative clustering. This new combination of methods avoids multiway associative clustering which wou...

متن کامل

Simulation Unification: Beyond Querying Semistructured Data

This article first reminds of simulation unification, a non-standard unification proposed at the 18th International Conference on Logic Programming (ICLP 2002) for making logic programming capable of querying semistructured data on the Web. This article further argues that, beyond querying semistructured data on the Web, simulation unification has a potential for Web querying of multimedia data...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000